chore: avoid hammering store at ntx builder startup#2064
chore: avoid hammering store at ntx builder startup#2064SantiagoPittella wants to merge 5 commits into
Conversation
| /// Bounds the burst of `GetNetworkAccountDetailsById` and `GetUnconsumedNetworkNotes` calls the | ||
| /// ntx-builder fires when catching up after a restart. | ||
| const DEFAULT_STORE_HYDRATION_CONCURRENCY: NonZeroUsize = | ||
| NonZeroUsize::new(4).expect("literal is non-zero"); |
There was a problem hiding this comment.
Do we know how slow things could be because of this?
There was a problem hiding this comment.
Not really, because my local DB isn't big enough to actually measure this. It would be really nice to have a copy of the DB in a server with similar specs to the devnet/testnet ones to compare it. Or at least running it locally with the testnet database.
Do you know if I can get access to it? maybe @Mirko-von-Leipzig ?
There was a problem hiding this comment.
I do have a 0.13 testnet database snapshot. I can upload it somewhere for you if that helps (but since it's for 0.13 I'm not sure it does...)
There was a problem hiding this comment.
yeah, probably wont be of much use
sergerad
left a comment
There was a problem hiding this comment.
LGTM but I think we need to make DEFAULT_STORE_HYDRATION_CONCURRENCY configurable even if we do test performance locally. Feels like the type of thing that might need changing in the live environment.
| /// ```sql | ||
| /// SELECT account_id FROM accounts WHERE transaction_id IS NULL | ||
| /// ``` |
There was a problem hiding this comment.
This is going to become a very large list one day.
There was a problem hiding this comment.
should we paginate it?
| // For each account that had an inflight row at startup, do a full hydration | ||
| // (account-details + unconsumed-notes) from the store. The inflight tx may have | ||
| // landed in a block we didn't witnessed, so locally-committed state cannot be | ||
| // trusted for these specific accounts. The set is bounded by `max_concurrent_txs` | ||
| // so this is small and we run it sequentially before opening the main loop. |
There was a problem hiding this comment.
Oh this is neat; I hadn't actually considered this. Nice!
| let kickoff_block = if catch_up_started { | ||
| None | ||
| } else { | ||
| match &event { | ||
| MempoolEvent::BlockCommitted { header, .. } => { | ||
| Some(header.block_num()) | ||
| }, | ||
| _ => None, | ||
| } | ||
| }; |
There was a problem hiding this comment.
We can simplify this by first waiting for the first block committed event outside of this main loop. Buffer the processed items into a Vec until then.
Once found, kickoff the catch up stuff, and then enter this main loop but chain the vec + mempool stream:
let mut events = futures::stream::iter(buffered_events).chain(mempool_events);There was a problem hiding this comment.
I think this works. But I must admit this more complex than I anticipated (not an impl fault; overlaps + concurrency is a pain).
I'm wondering if this is worth it. What are some alternate options?
We could revert to always requiring a full sync before continuing on.. this is a bit wasteful and means we cannot immediately begin producing network txs, but does simplify this to a: resync from last_sync..=tip and only then begin producing txs.. This also simplifies the code quite a bit.
I'm a bit on the fence..
wdyt?
There was a problem hiding this comment.
I mostly agree, if we can get at testnet snapshot we can start checking (though probably not enough) if this complexity is worth having or not
There was a problem hiding this comment.
BTW, what exactly would we need from the testnet snapshot? We certainly can recover some stats or measure some queries against the 0.13 snapshot, so maybe it does help?
closes #1952
This PR:
unconsumed_notesdelta per known account, dropping the per-accountGetNetworkAccountDetailsByIdcall.StoreClient(defaulted to 4) that bounds concurrent in-flight startup RPCs.store_sync_checkpointcolumn on chain_state (required a new SQL migration), distinct from the mempool-driven block_num.I tried it by:
mainversion of the node.